Representing Text Documents in Training Document Spaces: a Novel Model for Document Representation
نویسنده
چکیده
In this paper, we propose a novel model for Document Representation in an attempt to address the problem of huge dimensionality and vector sparseness that are commonly faced in Text Classification tasks. The proposed model consists of representing text documents in the space of training documents at a first stage. Afterward, the generated vectors are projected in a new space where the number of dimensions corresponds to the number of categories. To evaluate the effectiveness of our model, we focus on a problem of binary classification. We conduct our experiments on Arabic and English data sets of Opinion Mining. We use as classifiers Support Vector Machines (SVM) and k-Nearest Neighbors (k-NN) which are known by their effectiveness in classical Text Classification tasks. We compare the performance of our model with that of the classical Vector Space Model (VSM) by the consideration of three evaluative criteria, namely dimensionality of the generated vectors, time (of learning and testing) taken by the classifiers, and classification results in terms of accuracy. Our experiments show that the effectiveness of our model (in comparison with the classical VSM) depends on the used classifier. Results yielded by k-NN when applying our model are better or as those obtained when applying the classical VSM. For SVM, results yielded when applying our model are in general, slightly lower than those obtained when using VSM. However, the gain in terms of time and dimensionality reduction is so promising since they are dramatically decreased by the application of our model.
منابع مشابه
A New Document Embedding Method for News Classification
Abstract- Text classification is one of the main tasks of natural language processing (NLP). In this task, documents are classified into pre-defined categories. There is lots of news spreading on the web. A text classifier can categorize news automatically and this facilitates and accelerates access to the news. The first step in text classification is to represent documents in a suitable way t...
متن کاملA Joint Semantic Vector Representation Model for Text Clustering and Classification
Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...
متن کاملخوشهبندی اسناد مبتنی بر آنتولوژی و رویکرد فازی
Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...
متن کاملConcept Chain Based Text Clustering
Different from familiar clustering objects, text documents have sparse data spaces. A common way of representing a document is as a bag of its component words, but the semantic relations between words are ignored. In this paper, we propose a novel document representation approach to strengthen the discriminative feature of document objects. We replace terms of documents with concepts in WordNet...
متن کاملLearning Document Image Features With SqueezeNet Convolutional Neural Network
The classification of various document images is considered an important step towards building a modern digital library or office automation system. Convolutional Neural Network (CNN) classifiers trained with backpropagation are considered to be the current state of the art model for this task. However, there are two major drawbacks for these classifiers: the huge computational power demand for...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013